Automatic Hidden-Web Table Interpretation by Sibling Page Comparison

نویسندگان

  • Cui Tao
  • David W. Embley
چکیده

The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured data management. In this paper, we offer a conceptual modeling solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. We compare them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2,000 tables in source pages from three different domains—car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns, if necessary, as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic hidden-web table interpretation, conceptualization, and semantic annotation

The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in w...

متن کامل

HTML Table Interpretation by Sibling Page Comparison in the Molecular Biology Domain

There are large and growing amount of biological data that reside in different online repositories. Many of these repositories represent their data in tables. In order to automatically understand these online pages, a system that can interpret tables is desired. However, the longstanding problem of automatic table interpretation still illudes us [12]. We offer a solution for the common special ...

متن کامل

Structure based Data Extraction from Hidden Web Sources: A Review

In order to extract data from the web pages of Hidden web sources, many semi-automatic and automatic techniques are proposed based on structure and tags of HTML documents. These

متن کامل

مقایسۀ کاربرد انواع روش‎های ارزیابی دسترس‎پذیری وب‎سایت‎ها مطالعۀ موردی: وب‎سایت وزارتخانه‌های دولت جمهوری اسلامی ایران)

Purpose: The present research aims to comparatively study different methods for evaluating the accessibility of websites and analyze the results of case study concerning websites of ministries of Iranian government, in order to indicate the strengths, weaknesses, and differences in evaluation findings by applying each of website accessibility methods. Methodology: In this paper, initially the ...

متن کامل

Transforming Arbitrary Tables into F-Logic Frames with TARTAR

The tremendous success of the World Wide Web is countervailed by efforts needed to search and find relevant information. For tabular structures embedded in HTML documents typical keyword or link-analysis based search fails. The Semantic Web relies on annotating resources such as documents by means of ontologies and aims to overcome the bottleneck of finding relevant information. Turning the cur...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007